Modeling and Fine Tuning

Owner: Daniel Soukup - Created: 2025.11.01

In this notebook, we load the processed data and fit our models focusing on optimizing variations of XGBoost classifiers across multiple hyperparameters that balance variance and bias, while addressing the class imbalance discussed during EDA. We chose to go in-depth on comparing variations of this single model type to allow focus on the details.

NOTE: due to randomness in the model fitting and tuning process, rerunning the notebook might change the outputs (such as top predictors) and add inconsistencies with the current markdown.

Data Loading

Let's load our processed data and create feature/target dataframes for both train and test.

We notice that some special characters can cause issues with our model - we address this here.

Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.

Important Note: We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model as an unbiased estimate of our model on completely unseen data.

Modeling

Our current approach will focus on optimizing XGBoost binary classifiers. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:

Experiment Tracking

We will be recording our models and detailed metrics under two main experiments that will capture multiple runs.

Fit Baseline

As mentioned, we'll be using sample weights to adjust for the class imbalance:

In order to compare model variations, we need to split the train set into train and validation. For this, we set up our cross-validation helper and define base parameters for our model:

Lets test our function with logging the run:

Lets try with a large multiplier:

We can see that the multiplier has a significant effect on the aucpr score.

We can also test the recommended scale_pos_weight parameter that helps balance classes. A typical value to consider based on the XGBoost recommendations is sum(negative instances) / sum(positive instances) and is supposed to assign a weight independent of the sample for the whole positive class - we're supposed to get similar results.

Interestingly, we don't see as much of a difference so we'll leave this and explore in the future.

Optimize Hyperparameters - Main Run

Next, we'll look to optimize the model hyperparameters more systematically.

The function below defines the HP space to explore (parameters and their ranges), focusing on 5 such parameters with known strong effect on model performance and regularization:

Finally, we are ready to run our study, currently consisting of 40 trials:

Lets see the best results:

Tuning Analysis

Let's see how the HP choices impacted performance:

Overall, there is not much variance in the score except for some combinations where likely overfitting occurs (e.g., max_depth and n_estimators both high).

We will look at different projections of the HP space:

The patterns are not fully clear here. We expect the best performing models with the mid-to-higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the number of estimators is high).

In our experiemnts, the high scores often corresponded with smaller col samples (how many col's each estimater used) unless max depth was singificantly lowered. The small col sample again helps avoid overfitting although the patterns are maybe less clear.

While the patterns might not be the most clear here, we expect to see that having high boosting rounds and high sample leads to lower scores (the bottom right corner, likely overfitting again).

Given that some of the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger step size for boosting rounds.

Finally, we look at the multiplier effect:

On average, the higher the multiplier the better the aucpr score we got which is also show in the heatmaps below. It looks like we get the most benefit around >40 weighting.

This patter is nicely shown the heatmap above and below as well.

Predict

We save the predicted class and probabilities both calculated:

Interpretation

Finally lets look at the feature importances for our model too (top 20):

Notes rerunning the notebook can significantly change the results. We highlight a few commonalities below.

Observations:

All these findings align with our expectations and EDA. Our model picked up on the gender bias in our data (there are much more high earner males in the dataset than female) which can definitely be addressed in future model iterations - please see the slides for more info.

79% of high income earners were male, as opposed to 46% of low income. This statistical inparity is a strong signal for the model to pick up on and use for classification.

Save predictions

We finally save the results to their own datasets which can be used for evaluation: